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Abstract 

Background: Accurate gene structure annotation is a fundamental but somewhat elusive goal of genome projects, 
as witnessed by the fact that (model) genomes typically undergo several cycles of re-annotation. In many cases, it is 
not only different versions of annotations that need to be compared but also different sources of annotation of the 
same genome, derived from distinct gene prediction workflows. Such comparisons are of interest to annotation 
providers, prediction software developers, and end-users, who all need to assess what is common and what is 
different among distinct annotation sources. We developed ParsEval, a software application for pairwise comparison 
of sets of gene structure annotations. ParsEval calculates several statistics that highlight the similarities and differences 
between the two sets of annotations provided. These statistics are presented in an aggregate summary report, with 
additional details provided as individual reports specific to non-overlapping, gene-model-centric genomic loci. 
Genome browser styled graphics embedded in these reports help visualize the genomic context of the annotations. 
Output from ParsEval is both easily read and parsed, enabling systematic identification of problematic gene models 
for subsequent focused analysis. 

Results: ParsEval is capable of analyzing annotations for large eukaryotic genomes on typical desktop or laptop 
hardware. In comparison to existing methods, ParsEval exhibits a considerable performance improvement, both in 
terms of runtime and memory consumption. Reports from ParsEval can provide relevant biological insights into the 
gene structure annotations being compared. 

Conclusions: Implemented in C, ParsEval provides the quickest and most feature-rich solution for genome annotation 
comparison to date. The source code is freely available (under an ISC license) at http://parseval.sourceforge.net/. 



Background 

It was only a decade ago when annotating a eukaryotic 
genome required years of extensive collaboration and mil- 
lions of dollars of investment. Since then, the tremendous 
rate at which the cost of DNA sequencing has been drop- 
ping as well as increased accessibility to gene prediction 
software are placing genome sequencing and annotation 
well within the reach of most single investigator biology 
laboratories. As a result, proliferation of distinct annota- 
tion sets corresponding to the same genomic sequences 
is becoming increasingly common. Annotation sets for 
a particular genome can accumulate in a variety of sce- 
narios. When developing gene prediction software, it is 
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common to test the software on a genomic region for 
which a high-quality reference is available, running and 
re-running the software and comparing the resulting pre- 
dictions against the reference. Community groups pro- 
viding annotation for species- or clade-specific genomes 
typically release updated annotations following the ini- 
tial release. Affordable transcriptome sequencing provides 
individual labs with data to specifically improve annota- 
tions for particular genes of interest, for example with 
respect to alternative splicing. In each of these scenar- 
ios, multiple annotations associated with a common set of 
genomic sequences require comparative assessment. 

A variety of comparison methods exist, but none can 
fully address the growing needs of the community (see 
Table 1). Manual comparison approaches can trivially be 
ruled out as slow, tedious, error prone, and hopelessly 
unscalable. Although genome browsers have had a huge 
impact by making gene annotations accessible to a wide 
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Table 1 Annotation comparison methods 



Method 


Pros 


Cons 


Manual comparison 


minimal overhead 


extremely tedious; error prone; unscalable 


Genome browser 


intuitive interface; visual assessment of indi- 
vidual loci 


visual assessments imprecise; extensive overhead; little or no 
automation 


Eval 


detailed statistics; visual assessment of statis- 
tic distributions; scales fairly well for large 
data sets; can compare multiple predictions 
to a single reference 


older software; relatively slow; only summary statistics are 
reported, while stats for individual loci are discarded 


Pars Eva 1 


detailed statistics provided, not only as a 
summary butfor individual loci as well; scales 
well for large data sets; fast, efficient, and 
portable 


only capable of comparing a single pair of annotations 



Various approaches to comparing alternative sources of gene structure annotations, with a brief description of the associated pros and cons. 



variety of scientists, they likewise do little to provide the 
automation and precision needed in whole-genome anno- 
tation comparisons. Large genome sequencing projects 
and centers have certainly developed in-house scripts and 
pipelines over the years to address this need. However, 
these pipelines are typically not standardized, not openly 
shared, and do not migrate well. 

Tools such as the Eval package [1] and the GFPE pro- 
gram [2] represent some of the earliest efforts to provide 
a reusable, easy-to-use annotation comparison tool to the 
community. Eval in particular stands out based on the 
amount of detail provided by its reported comparison 
statistics and by the ability to visualize the distributions 
of these statistics. Eval takes as input annotation files in 
Gene Transfer Format (GTF) and calculates a rich set of 
descriptive statistics summarizing the differences between 
the annotations. Because whole-genome annotations typ- 
ically include thousands (or tens of thousands) of genes, 
these statistics are intended to condense the information 
into a comprehensive yet concise summary (at the resolu- 
tion of entire sequences or sets of sequences), facilitating 
targeted improvement of gene prediction software. Unfor- 
tunately, this condensing process discards large amounts 
of valuable information at the resolution of individual 
gene loci, making the tool unsuitable for analyses that 
target a particular gene, sets of genes, or gene loci with 
characteristics of interest from within a larger set of genes. 
Such locus-resolution comparisons are useful not only 
to software developers and annotation producers who 
need to know whether their software has distinct advan- 
tages or disadvantages, e.g., favoring long over shorter 
gene models on average, or failing in untranslated region 
(UTR) prediction, but they are of primary interest for 
specialists concerned with a particular gene family or 
pathway. 

Motivated by a need for genome-scale evaluations with 
locus-scale detail, we developed ParsEval, a program for 
comparing and analyzing distinct sets of gene structure 



annotations for the same input sequences. The program 
is designed to incorporate all of the benefits of existing 
methods while addressing their shortcomings. ParsEval 
identifies differences in exon/intron assignments and in 
coding sequence (CDS) and UTR designations, at both 
feature-level (exon, CDS segment, UTR segment) and 
nucleotide-level resolution. The output consists of a set 
of commonly used statistics that provide quantitative 
measures of agreement when comparing predicted gene 
structures against a standard reference [3-5]. This out- 
put is presented in a detailed report for each gene locus, 
supplemented with genome browser styled graphics to 
enable additional visual assessment and analysis of the 
annotations. The statistics are also presented in a single 
summary report that aggregates the statistics across all 
loci, providing a condensed high-level view of the similar- 
ity between the two sets of annotations. For gene loci that 
include alternatively spliced genes or overlapping genes 
(or both), ParsEval determines the optimal matching of 
reference transcripts to prediction transcripts, and addi- 
tionally reports any novel transcript predictions that have 
been identified. 

Implementation 

Overview 

ParsEval is a command-line tool for gene annotation com- 
parison and analysis. The program takes as input a pair 
of gene structure annotations corresponding to the same 
sequence (in GFF3 format [6]), analogous to two sepa- 
rate annotation tracks one might see in a genome browser. 
For comparison purposes, the first set of annotations 
is treated as the reference while the other is treated as 
the prediction, although ParsEval makes no assumptions 
regarding the respective quality of the two annotation 
sets. The output of the program is a set of reports con- 
taining common comparison statistics intended to high- 
light relevant similarities and differences between the two 
sources of annotation. 
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ParsEval first loads the annotation data into memory, 
identifies start and end coordinates for gene loci, and asso- 
ciates each gene annotation with a single locus. Next, 
the program does a comparative assessment of the gene 
annotations for each locus, calculating and storing a vari- 
ety of informative similarity statistics. Finally, ParsEval 
generates reports providing a detailed readout of these 
statistics. 

Implemented in ANSI C, ParsEval is fast, memory 
efficient, and portable, designed to run on all POSIX- 
compliant UNIX systems (Linux, Mac OS X, Cygwin, 
Solaris, etc.). Most of the analysis code is implemented 
with shared memory parallelization, providing additional 
performance gains when running on multicore proces- 
sors that are becoming increasingly common in commod- 
ity hardware. ParsEvals only external dependency is the 
GenomeTools library [7], which provides an API for gen- 
erating annotation graphics with AnnotationSketch [8], as 
well as implementations of a variety of data parsers and 
dynamic data structures. 

Gene locus identification 

Comparative analysis of two sets of gene annotations 
requires determining how annotations from one set cor- 
respond to annotations from the other, as well as the 
genomic coordinates (the gene locus) that should be con- 
sidered in each comparison. For rare cases in which a 
single reference annotation and a single prediction anno- 
tation line up perfectly, determining the gene locus and 
the corresponding genes is trivial. However, in most 
cases this task is complicated a variety of factors. For 
example, a single gene prediction workflow may anno- 
tate multiple genes at a single location, so one must 
determine how to associate these annotations with cor- 
responding annotations from an alternative source. Fur- 
thermore, when one or more gene annotations from one 
source overlap with multiple annotations from another 
source, one must determine how to compare these gene 
annotations and which coordinates to include in the 
comparison. 

One common approach involves designating one set of 
annotations as the reference set and then using the coordi- 
nates of each reference gene annotation to define a distinct 
gene locus to serve as the basis for subsequent compari- 
son (see Figure 1). However, this approach is unfavorable 
for several related reasons. First, reference gene annota- 
tions that overlap are handled separately, when it makes 
more sense to associate them with the same locus and 
handle them together. Second, it forces a quality judg- 
ment between the two sets of annotations when their 
relative quality is often unknown. The two sets of anno- 
tations likely include complementary information, and 
unless there is a clear distinction in quality between the 
two, choosing one as a reference discards clearly related 



information from the other. Third, relevant informa- 
tion from predicted gene models that extend beyond the 
boundaries of the corresponding reference annotation is 
ignored. 

Although ParsEval uses the terms reference and 
prediction to distinguish between the two sets of annota- 
tions, both are considered equally when identifying gene 
loci. Each gene annotation corresponds to a node in an 
interval graph G. There is an edge between two nodes G; 
and Gj if the corresponding gene annotations overlap (see 
Figure 2). Each connected component in G then corre- 
sponds to a distinct gene locus, which we define as the 
smallest genomic region containing every gene annotation 
associated with the corresponding subgraph. Defining a 
gene locus in this way makes no assumptions as to the rel- 
ative quality of the two sets of annotations, and ensures 
that no potentially relevant data are discarded. Further- 
more, according to this definition each gene locus is inde- 
pendent, enabling the subsequent comparative analysis 
tasks to run in parallel. 

Gene structure representation 

To facilitate analysis at each gene locus, ParsEval converts 
GFF3 annotations for each gene into a character string 
representing the annotated gene structure (a model vec- 
tor). This model vector is similar to a sequence in Fasta 
format, except instead of using the alphabet {A, C, G, T} 
to represent chemical composition at each nucleotide, 
the alphabet {C,F, G,/, T} representing gene structure 
is used: C for coding sequence, F for 5'-UTR, T for 
3'-UTR, / for introns, and G for intergenic sequence. 
Using this alphabet, each transcript can be represented 
by a single model vector. ParsEval uses these model 
vectors when comparing reference and prediction gene 
annotations. 

In many cases, a single pair of model vectors (one 
for the reference, one for the prediction) is sufficient 
to fully represent annotated gene structure at a given 
locus. This is certainly true when both the reference 
and the prediction annotate a single gene with a sin- 
gle mRNA product at the locus. But even if the ref- 
erence (or the prediction) annotates multiple genes or 
transcripts, non-overlapping annotations can be encoded 
in the same model vector and compared simultane- 
ously with corresponding annotations from the other 
data set. However, if either the reference or the pre- 
diction contains annotations for overlapping transcripts, 
either because of alternative splicing or because of over- 
lapping gene models, a single pair of model vectors is 
insufficient to represent the complete annotated gene 
structure at that locus. In these more complicated cases, 
the reference or the prediction or both will be associ- 
ated with multiple model vectors. Thus, the algorith- 
mic requirement is to represent all annotated transcript 
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Figure 1 Associating Annotations with Gene Loci. The black bar provides a scale corresponding to a genomic region for which two sets of 
annotations are available. Reference annotations for gene structure are represented with red glyphs, while prediction annotations are shown with 
blue glyphs. Arrows indicate the strand of the gene annotation, and different levels of shading correspond to different gene structure features: dark 
shading for coding sequence, medium shading for UTRs, and light shading for introns. Green brackets denote gene loci as determined by the 
common practice of using only the genomic coordinates from reference gene annotations.Orange brackets denote gene loci as determined by 
ParsEval, which takes into account both reference and prediction annotations when selecting distinct loci for comparison. 



structures in the locus using the smallest number of 
model vectors. 

This problem reduces to a common problem in graph 
theory known as the maximal clique enumeration prob- 
lem [9]. We treat each transcript as a node in an undi- 
rected graph and place an edge between two nodes if 
the corresponding transcripts do not overlap (unlike the 
locus identification step, reference annotations and pre- 
diction annotations are handled separately in this step). 
Each maximal clique (maximal fully-connected subgraph) 
in this graph corresponds to a set of transcripts that do 
not overlap and can therefore be collapsed into a single 
model vector. ParsEval uses the Bron-Kerbosch algorithm 
[9] to enumerate all maximal transcript cliques, first for 
the reference and then for the prediction. A model vec- 
tor is generated for each clique, after which ParsEval 
compares all reference model vectors with all prediction 
model vectors. 



Comparative analysis of annotations 

Given a pair of equal-length model vectors representing a 
pair of gene structure annotations at a given locus, ParsE- 
val computes a variety of comparison statistics to measure 
the level of agreement between the pair of annotations. 
Calculated at different levels of resolution, these statistics 
provide a detailed assessment of similarity between the 
reference and the prediction. At the resolution of distinct 
annotation features, ParsEval calculates the sensitivity and 
specificity as described in [3], the Fl score as described 
in [4], and the annotation edit distance as described in 
[5,10]. These statistics are calculated for exons, CDS seg- 
ments, and UTR segments. Note that for a prediction 
feature to be considered a true positive, ParsEval requires 
both the start and end coordinates to match the reference 
perfectly. 

At the nucleotide-level resolution, ParsEval also calcu- 
lates the sensitivity, specificity, Fl score, and annotation 
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Figure 2 Locus Identification Using a Gene Interval Graph. Red and blue nodes in this interval graph correspond to reference and prediction 
gene annotations (respectively) as shown in Figure 1 . Two nodes are connected by an edge if the corresponding gene annotations overlap. Each 
connected component in the graph represents a distinct gene locus, defined as the smallest genomic region containing every gene annotation 
associated with the corresponding subgraph. In this example, nodes representing five reference annotations and four prediction annotations are 
shown. The four connected components in the graph correspond to four gene loci, for which precise genomic coordinates can be determined from 
the associated genes (shown in orange brackets in Figure 1). 
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edit distance, as well as the simple matching coefficient 
and the correlation coefficient as described in [3]. These 
statistics are calculated for coding nucleotides (CDS) and 
untranslated exonic nucleotides (UTR). Overall identity 
at the nucleotide level, of which the simple matching 
coefficient is a generalization, is also computed. 

For complex loci requiring multiple comparisons, the 
locus report includes an aggregate summary of the simi- 
larity statistics at the locus level in addition to the reports 
for each individual comparison. This locus-level summary 
also includes the splice complexity statistic [5], which Par- 
sEval computes and reports for both the reference and the 
prediction at the locus level. 

Based on the computed statistics, each comparison is 
classified in terms of similarity. A comparison is classified 
as a perfect match if the model vectors (and by implication 
the annotated gene structures) are identical. A compari- 
son is classified as a CDS structure match if the compari- 
son is not a perfect match, but there is perfect agreement 
in terms of CDS structure. A comparison is classified as 
an exon structure match if there are differences in the cod- 
ing sequence that nevertheless preserve exon structure (as 
resulting from different start and/or stop codons). A com- 
parison is classified as a UTR structure match if there are 
differences in CDS and exon structure, but the UTR struc- 
tures are identical. All other comparisons are classified as 
non-matches. 

Note that, as with feature-level statistics, match classi- 
fications require perfect agreement. For instance, a pair 
of annotations may have very similar CDS structures, and 
this will be reflected in the nucleotide-level CDS statistics. 
However, if the CDS structures are not precisely identical, 
the comparison will not be classified as a CDS structure 
match. 

As comparison statistics are computed on a locus-by- 
locus basis, ParsEval also maintains a running total of all 
comparison counts (such as true positives and false pos- 
itives) from which the statistics are computed. When all 
loci have been considered, each comparison statistic is 
then recomputed using these running totals to provide an 
overall assessment of similarity. 

Reporting comparison scores 

For each gene locus, comparison statistics are calculated 
for each corresponding pair of reference and prediction 
model vectors. If multiple comparisons are required at a 
locus, however, statistics are not reported for each com- 
parison. The comparisons are ranked using the previously 
described similarity statistics and are reported so as to 
ensure each transcript (or transcript clique) is considered 
at most one time. In cases where there is an unequal 
number of reference and prediction transcripts (or tran- 
script cliques) associated with a particular locus, some 
will be labeled as novel or unmatched transcripts, and 



corresponding statistics are not included in ParsEvals 
reports. 

ParsEval presents the comparison statistics in a col- 
lection of reports. The first is a single summary report 
providing the aggregated statistics for a high-level assess- 
ment of similarity, as is standard for tools of this kind. 
Additionally, ParsEval produces a dedicated comparison 
report for each individual locus. The detail provided by 
these locus-level reports is extremely valuable, and ParsE- 
val is the only tool of its kind that preserves and reports 
comparisons at this level. By default, ParsEval generates 
these reports in an easy-to-parse and easy-to-read text 
format. However, ParsEval can also generate the reports 
as hyperlinked HTML files to facilitate browsing and 
network-based distribution. Furthermore, ParsEval can 
supplement HTML reports with embedded PNG graph- 
ics providing a genome-browser-like view of each locus' 
genomic context and enabling visual assessment of the 
annotations. 

If more targeted reporting is desired, ParsEval also pro- 
vides some filtering features. Using a simple optional 
configuration file, the user can exclude some gene loci 
from the reports based on a variety of features: locus 
length, number of genes, number of transcripts, number 
of transcripts per gene, number of exons, and CDS length. 
No comparisons are performed for loci that are filtered 
out, and thus do not contribute to the reported aggregate 
summary statistics and comparison classifications. 

To facilitate integration of comparison reports with 
popular genome browsers such as GBrowse [11] and 
PlantGDB [12], ParsEval can generate an additional out- 
put file (in GFF3 format) containing the coordinates 
of each gene locus. These genome browsers commonly 
allow users to anonymously create private custom tracks 
with uploaded data, which provides the quickest mecha- 
nism for integration. Once a track is populated with the 
uploaded locus data, the user can edit the track configura- 
tion so that each locus feature in the track is hyperlinked 
to the corresponding ParsEval report, which may have 
been stored, for example, on that users local machine (see 
Figure 3). Alternatively, if a more permanent and public 
solution is desired, a user with administrative privileges 
for the genome browser can follow standard procedures 
for populating a new track with the GFF3 data and then 
configure the track so that locus features are linked to 
network-accessible ParsEval reports. 

Results and discussion 

We present several use cases to demonstrate ParsEvals 
capabilities, benchmark its performance, and compare 
its utility relative to existing methods. The input data 
for these demonstrations were obtained from a variety 
of public databases with different respective formatting 
conventions. Accordingly, all data files were processed 
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Figure 3 Integrating ParsEval Reports with a Genome Browser. Screenshot of the Arobidopsis tholiono genome browser at Phytozome (http:// 
phytozome.net/), with a custom anonymous user track populated by ParsEval output. Boxes in this custom track represent loci identified by ParsEval 
and are color-coded according to the level of agreement between the two sets of annotations compared (dark red and pastel blue glyphs, 
respectively). This custom track can easily be configured so that features are hyperlinked to ParsEval reports containing detailed comparison statistics . 



and converted to a uniform format before analysis. A 
detailed description of this conversion process, along 
with all code and commands used, are provided in the 
Additional file 1 as well as in ParsEvals source code 
distribution. 

Unless otherwise noted, all use cases and benchmarks 
described herein were run on a fairly modest desk- 
top computer: a Mac Pro with two 2.8 GHx quad-core 
Intel Xeon processors and 4 GB of RAM. ParsEvals per- 
formance for these demonstrations should therefore be 
fairly representative of the performance one might expect 
when running on commodity laboratory or personal 
hardware. 

Use case: predictions vs. gold standard 

High-quality gene structure annotations derived from 
a combination of computational and experimental evi- 
dence, and possibly improved with expert manual cura- 
tion, are indispensably used as "gold standards" for mea- 
suring the accuracy of a novel gene prediction method or 
entire new annotation workflows. Identifying differences 
between the new methods predictions and such gold stan- 
dard reference can help identify areas in which the novel 
method provides or needs improvement. Reports from 
ParsEval are effective for quickly and clearly identifying 
such differences. 

To demonstrate ParsEval in this context, we reproduced 
a comparison that was originally published to assess the 
performance of the AUGUSTUS gene prediction program 
[13]. In the original study, AUGUSTUS was tested on the 
hi 78 data set [14], a set of 178 human genomic sequences, 
each containing a single gene, for which annotations were 
available from the EMBL database release 50 [15]. Gene 
predictions from AUGUSTUS were compared the anno- 
tations from EMBL, and sensitivity and specificity scores 
were calculated at the nucleotide level, the exon level, and 
the gene level. 



We obtained the hi 78 data set (sequences and EMBL 
r50 annotations) from [16]. We then used the latest ver- 
sion of AUGUSTUS (2.5.5) to generate gene predictions 
for the 178 sequences. The data files were reformatted 
and then compared using ParsEval. Running on a desktop 
computer, ParsEval generated graphical reports in less 
than a minute. The summary report provided immediate 
access to a variety of similarity metrics, including those 
reported in the original assessment. The sensitivity and 
specificity values reported by ParsEval are comparable to 
those reported in the original AUGUSTUS manuscript 
(see Table 2). Differences in the comparison metrics can 
likely be explained by improvements to the AUGUSTUS 
program since publication, although the exact reason is 
elusive because the original AUGUSTUS software is no 
longer accessible. 



Table 2 Use case: prediction vs. gold standard 



Statistic 


AUGUSTUS 


ParsEval 




manuscript 


comparison 


Coding nucleotide sensitivity 


0.93 


0.94 


Coding nucleotide specificity 


0.90 


0.99 


Exon sensitivity 


0.80 


0.81 


Exon specificity 


0.81 


0.86 


Gene sensitivity 


0.48 


0.43 


Gene specificity 


0.47 


0.46 



Sensitivity and specificity scores for AUGUSTUS gene predictions in comparison 
to corresponding gene annotations from EMBL database release 50. The first 
column shows scores as reported in the original AUGUSTUS manuscript. The 
second column shows scores as computed by ParsEval using predictions from 
the latest version of AUGUSTUS (2.5.5). Summary reports from ParsEval provide 
immediate access to a wide variety of similarity statistics, including the ones 
reported in this table. Differences between the scores reported by the 
AUGUSTUS authors and the ParsEval authors are likely due to subsequent 
updates of the AUGUSTUS program since its publication. 
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Use case: two sets of annotations 

When working with genome annotations, there is an 
increasing variety of cases in which no gold standard 
is available for comparison. For example, gene annota- 
tions for many model species are available from a variety 
of sources (i.e., UCSC versus Ensembl). The respective 
quality of these different annotation sets is not always 
clear, but comparison is still a necessary and fundamen- 
tal task. Another example relates to genome projects 
that typically offer multiple releases of gene annotations 
between each major genome assembly release. Although 
newer releases may offer marginal improvements over the 
older ones, neither one can truly be considered a high- 
quality standard reference for comparison. An additional 
example relates to the increased affordability of genome 
sequencing and the number of new and exotic species 
for which genome sequence is available. Gene annotation 
software is based on complex statistical models contain- 
ing many parameters, and it is not always initially clear 
which parameter values to use up front. Therefore, when 
annotating a newly sequenced genome, it is common to 
extract a subset of the genome on which to perform 
repeated optimization runs to determine the parameter 
values that should be used subsequently to annotate the 
entire genome. 

In each of these scenarios, multiple annotation sets must 
be compared, despite having no intuition as to the rel- 
ative quality of the respective annotations. ParsEval was 
designed precisely for this type of analysis. Reports from 
ParsEval provide both an overall summary and locus-level 
detail, enabling the user to make informed decisions about 
annotations for individual loci, as well as for annotation 
sets as a whole. 

As a demonstration of ParsEvals capability in this con- 
text, we downloaded two recent gene annotation releases 
(releases 64 and 65) for Mus musculus from the Ensembl 
database [17]. We compared these annotations using Par- 
sEval, which required approximately 3 minutes of runtime 
on a desktop computer. A brief review of ParsEvals sum- 
mary report shows that a total of 20,362 gene loci were 
identified using these annotations (see Table 3 for a com- 
plete breakdown). Of these gene loci, 6,725 had only 
annotations from release 64. 

23,590 comparisons were performed by ParsEval, of 
which 22,333 (94.7%) were perfect matches between 
releases 64 and 65. A small number (83, 0.4%) of compar- 
isons were classified as UTR structure matches. For the 
remaining 1,174 comparisons (5.0%) that were classified 
as non-matches, transcripts from release 64 contained an 
average of 16.47 exons, whereas transcripts from release 
65 contained an average of 8.11 exons. A brief review of 
a handful of selected loci showed that many long tran- 
scripts (with many exons) that had been present in release 
64 were absent in release 65. 



This use case is an ideal demonstration of ParsEvals 
capabilities. Although the authors have no prior experi- 
ence working with these particular data sets, a cursory 
examination ParsEvals reports clearly draw attention to 
an important fact— between release 64 and 65, changes 
to Ensembls annotation pipeline (perhaps different values 
for parameters that influence joining/splitting annota- 
tions, or implementation of stricter filters for gene length) 
affected approximately 5% of the gene annotations. Not 
only does ParsEval provide this information in a summa- 
rized form, it also provides detailed locus reports enabling 
users to scrutinize the results on a gene-by-gene basis. 
This breadth and detail of information is of great benefit to 
a wide variety of scientists and will empower them to more 
fully understand the available data and make informed 
decisions regarding alternative sources of annotation. 

Benchmarks 

To demonstrate its speed, scalability, and efficiency, 
we benchmarked ParsEval by analyzing pairs of whole- 
genome gene structure annotations for four common 
model organisms representing a wide range of eukaryotic 
diversity: Arabidopsis thaliana (thale cress), Drosophila 
melanogaster (fruit fly), Glycine max (soybean), and Homo 
sapiens (human) (see Table 4). To give a detailed demon- 
stration of its performance, ParsEval was run 24 times 
for each species— 3 technical replicates while varying the 
output mode (text and HTML/PNG) and the number of 
dedicated processors (1, 2, 4, and 8). Reported runtimes 
were obtained by taking the mean of the 3 corresponding 
replicates. 

Performance in text output mode 

ParsEval demonstrated optimal performance when run- 
ning in text output mode, with runtimes ranging between 
about 30 seconds to about 4 minutes. Running ParsE- 
val in parallel on multiple processors provided notice- 
able improvement in runtime for Drosophila and human, 



Table 3 Use case: two sets of annotations 



Perfect matches 


22,333 


94.7% 


CDS structure matches 


0 


0.0% 


Exon structure matches 


0 


0.0% 


UTR structure matches 


83 


0.4% 


Non-matches 


1,174 


5.0% 



Total comparisons 23,590 1 00 0% 



Results from a ParsEval comparison of gene annotations for Mus musculus from 
two recent releases of the Ensembl database (releases 64 and 65). Release 64 
contains 22,507 gene annotations, while release 65 contains 14,486 gene 
annotations. ParsEval identified 20,362 gene loci using these two data sets, 6,725 
of which contained only annotations from release 64. For the 1 3,637 gene loci 
for which both release 64 and 65 have annotations, 23,590 comparisons were 
performed. Each of these comparisons was classified according to how well the 
annotations from the two releases agreed. This table shows a breakdown of 
these results. 
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Table 4 Benchmarks 





A. thaliana 


D. melanogaster 


G. max 




H. sapiens 


Reference annotations 


TAIR9 


FlyBase5.39 


NCBI Entrez 


UCSC knownGene (hg 1 9) 


Prediction annotations 


TAIR10 


Ensembl r65 


JGI / Phytozome 




Ensembl r65 


Average runtime (sec) 


Text HTML 


Text HTML 


Text HTML 


Text 


HTML 


n = 1 


36.3 859.4 


91.1 1,350.5 


85.3 1,461.1 


294.3 


6,422.0 


n = 2 


32.8 449.2 


56.6 859.5 


79.4 768.4 


181.3 


4,089.5 


n = 4 


30.7 246.5 


39.2 633.7 


76.5 439.9 


130.1 


2,751.2 


n = 8 


29.8 168.7 


32.4 546.6 


76.3 330.5 


108.0 


2,323.3 


Gene loci 


25,618 


1 0,976 


47,877 




1 7,865 


shared 


25,590 


1 0,944 


37,942 




7,779 


unique to reference 


6 


32 


3,363 




9,569 


unique to prediction 


22 


0 


6,572 




517 


Comparisons 


33,002 


22,474 


38,734 




16,168 


perfect matches 


31,750 96.2% 


22,446 99.9% 


2,489 6.4% 


2,517 


1 5.6% 


CDS structure matches 


420 1 .3% 


0 0.0% 


17,450 45.1% 


8,269 


51.1% 


exon structure matches 


8 0.0% 


21 0.1% 


26 0.1% 


27 


0.2% 


UTR structure matches 


159 0.5% 


1 0.0% 


647 1 .7% 


58 


0.4% 


non-matches 


665 2.0% 


6 0.0% 


18,122 46.8% 


5,297 


32.8% 


As a demonstration of ParsEval's speed and scalability, we obtained pairs of whole-genome annotations for Arabidopsis thaliana (thale cress), Drosophila melanogaster 
(fruit fly), Glycine max (soybean), and Homo sapiens (human). For each organism, we used ParsEval to compare the two corresponding sets of annotations. Runtimes 
are shown for both text and HTML/PNG output modes, using 1, 2, 4, and 8 processors. For each organism, we also show the number of gene loci identified, how many 
were shared between the two sets of annotations, and how many are unique to one set. Finally, we show the number of reported comparisons for each organism and 
how many were perfect gene structure matches, how many were CDS structure matches, and how many were non-matches. All of the results shown in this table were 
easily obtained from the summary reports generated by ParsEval. 



although no improvement was seen for Arabidopsis and 
soybean. It is likely that for loci with relatively small and 
simple gene structures, ParsEvaFs runtime is bound more 
by serial I/O related tasks than by actual analytical com- 
putations, which would explain why no improvement was 
observed for the plant species. 

Performance in HTML output mode with PNG graphics 

Running ParsEval in HTML/PNG output mode increased 
the runtimes by an order of magnitude, although parallel 
processing kept these runtimes within a reasonable range 
(about a half hour for the most intensive comparison) with 
observed speedup factors ranging from 3 to 5 when using 
all 8 processors. Because these improvements in runtime 
were observed for all species, it is likely that ParsEvaFs 
runtime is bound primarily by computationally intensive 
graphics generation tasks when running in HTML/PNG 
output mode. 

Notes on benchmark results 

The results of the A. thaliana benchmark were not sur- 
prising. Perfect matches and CDS matches account for 
97.5% of the comparisons, which makes sense consider- 
ing that TAIR10 represents minor cumulative updates to 
TAIR9. There were even fewer differences between Fly- 
Base and Ensembl annotations for the D. melanogaster 



benchmark 0.1% of loci), suggesting perhaps that these 
differences may be the consequence of technical artifacts 
in one data set or the other. 

The results of the other two benchmarks, for G. max 
and H. sapiens, were somewhat surprising. In each case, 
approximately 10% of the comparisons reflected perfect 
matches between the two annotations (6.4% for soybean 
and 15.3% for human), while approximately 50% of the 
comparisons reflected CDS matches (45.1% for soybean 
and 54.9% for human). Therefore, for the remaining 
approximate 30% of human genes and 50% of soybean 
genes, the annotated coding sequences (and associated 
polypeptides) are different depending on the annotation 
source. These differences are likely the result of differ- 
ent annotation strategies between alternative annotation 
approaches. Until the problem of gene structure predic- 
tion is completely solved, alternative approaches yielding 
alternative results will be inevitable. The ParsEval tool 
will aid both producers and users of gene structure anno- 
tations to quickly assess the extent and nature of the 
approach-based differences. 

Performance evaluation in comparison to Eval software 

To evaluate ParsEvaFs performance in comparison to 
existing methods, we used the Eval tool [1] to repeat one 
of the previously described use cases. Gene annotations 
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for Mus musculus were retrieved from releases 64 and 
65 of the Ensembl database, and subsequently analyzed 
using both Eval and ParsEval. Some small differences were 
observed in the similarity statistics computed by the two 
programs, although this was not unexpected as Eval uses 
a different approach than ParsEval for matching refer- 
ence annotations to prediction annotations. Also, the two 
programs provide a different breakdown of the similarity 
statistics, making a rigorous comparison between the Eval 
results and the ParsEval results impractical 

Running Eval on the complete data sets exhausted the 
desktop computers memory resources after several min- 
utes, so comparison of Eval and ParsEval was only possible 
after restricting the data sets to annotations for M. muscu- 
lus chromosomes 1 through 10. To analyze these reduced 
data sets, Eval required an average of 12 minutes 13 sec- 
onds and consumed all available memory. On the other 
hand, ParsEval, running on a single processor, required an 
average of 1 minute 44 seconds, with memory consump- 
tion peaking at approximately 0.5 GB. When run on 4 
processors, ParsEvals performance margin increased with 
an average runtime of 47 seconds. 

To ensure that EvaFs performance was not being 
severely affected by the desktops limited system mem- 
ory, the comparison was also performed in a high- 
performance computing environment in which memory 
could not have been a limiting factor. ParsEval continued 
to demonstrate superior performance in this environment 
as well, although by a slightly less drastic margin. The Eval 
program required an average of 7 minutes 18 seconds of 
runtime, while ParsEval required an average of 1 minute 
19 seconds using a single processor, or 37 seconds using 4 
processors. 

These tests conclusively demonstrate two important 
points regarding the performance of ParsEval relative to 
Eval: not only is ParsEval markedly faster, but its resource 
efficiency also makes it much better equipped to run 
whole-genome comparisons on the laptop or desktop 
computers one might expect to see in the typical biology 
lab. The initial runtimes reported herein should be fairly 
representative of what users can expect to observe when 
running ParsEval on commodity hardware. 

Conclusions 

The accessibility of genome annotation tools to an 
increasingly wider variety of scientists will soon be accom- 
panied by an increased demand for supplementary tools 
to manage and analyze genome annotations. We address 
this need with ParsEval, a tool for fulfilling a common, 
fundamental analytical need for which existing software 
is lacking. ParsEval is a portable, easy-to-install, and 
efficient program for comparing gene structure anno- 
tations, and facilitates a wide variety of downstream 



comparative analyses. We demonstrate the speed and 
scalability of ParsEval, even when working with large 
eukaryotic genomes. Furthermore, we highlight the capa- 
bility of the detailed comparison statistics in ParsEval 
reports to highlight relevant biological trends in the data. 
We anticipate that ParsEval will enable a wide variety 
of biologists to more fully take advantage of the vast 
genome annotation data resources accumulating in their 
individual labs and in the community at large. 

Availability and requirements 

• Project name: ParsEval 

• Project home page: http://parseval.sourceforge.net 

• Operating system(s): POSIX-compliant UNIX 
systems (Linux, Mac OS X, Cygwin, Solaris, etc.) 

• Programming language: ANSI C 

• Other requirements: C compiler with OpenMP 
support (such as GCC 4.2 or higher), GenomeTools 
library http://genometools.org 

• License: ISC 

• Any restrictions to use by non-academics: none 



Additional file 



Additional file 1 : Supplemental data. The file StandageBrendel-7-6-1 2- 
SupplementalData.tar.gz is a gzip-commpressed tar archive that stores a 
self-contained web page. This page includes supplemental information for 
users regarding the use cases and benchmarks described in the paper, 
providing detailed instructions for obtaining the corresponding data and 
code for carrying out the use cases and benchmarks. 
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